The rapid growth of spatial-temporal data is advancing analysts’ intelligence capabilities. With the abundance of available GPS data, analysts can study an individual’s pattern of life and inform resource allocation decisions. Visibility on when an individual stops, how long they stop, and when they are in-transit can aid in characterizing an area of interest, especially in areas when other information is sparse. Analysts currently use spatial density-based clustering techniques to glean insights from large GPS data sets, which has proved useful in characterizing areas of interest by identifying locations where this is a high number of observations occuring relatively close together. Loiter information, however, can only be analyzed by adding a time component to the clustering. This tutorial offers a method to employ Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) on both spatial and temporal information. We found that clustering with HDBSCAN spatially and temporally can overcome the known disadvantages of common density-based clustering methods and yield reliable loiter information to inform a pattern of life.
The challenge to using spatial-temporal data is correctly labelling an observation as (substantially) moving or loitering (non-moving). The main assumption common to all loiter inference algorithms is that when an individual is loitering their observations will be relatively close geographically. In contrast, when an individual is moving, the distances between consecutive observations will be comparatively larger. Estimating this relative spatial difference is the key to classifying an observation as loitering or moving. Analysts employ well-established unsupervised learning clustering algorithms to detect possible loiter locations from GPS data such as time-based clustering (Hariharan and Toyama 2004) and (Liu, Wolfson, and Yin 2006), the centroid-based k-means clustering (Ashbrook and Starner 2002), and various density-based algorithms such as DBSCAN (Ester, Kriegel, and Xu, n.d.), ST-DBSCAN (Birant and Kut 2007), T-DBSCAN (Chen, Ji, and Wang 2014), and HDBSCAN (Campello et al. 2015).
Time-based clustering methods are used with data when movement triggers observations. This method determines possible loiter locations by finding observations within the defined spatial maximum distance and over a longer period than the defined minimum time interval (Hariharan and Toyama 2004) and (Liu, Wolfson, and Yin 2006). This method has been shown to succeed in finding clusters of tightly-placed observations with large inter-observation time intervals, typical of a loiter location. However, analysts are not always afforded data on every individual movement. Without such comprehensive coverage, time intervals without observations are no guarantee of lack of movement and other clustering methods should be used. If observations are more of a random sample of individual locations, loiter locations might be found by density of points alone. The k-means method divides the data into k regions clustered as best as possible around their centroids, which could be candidate loiter locations (Ashbrook and Starner 2002). Unfortunately, this model can generate inaccurate or spurious candidate locations if \(k\) is not correctly chosen, and determining the proper \(k\) can be an iterative, subjective process. Additionally, the time dimension - an integral part of the definition of loiter information - is completely discarded by k-means.
Density-based methods can identify clusters, including irregularly-shaped ones that don’t necessarily stick close to their centroids, without needing to specify how many clusters to search for. DBSCAN takes two parameters: a maximum spatial distance \(epsilon\) and the minimum number of observations required for a new cluster, \(minPts\). From these parameters, DBSCAN produces clusters of observations spaced within \(epsilon\) spatial distance (Birant and Kut 2007). ST-DBSCAN and T-DBSCAN further discriminate clusters based on temporal distance, which avoids identifying non-loiter clusters such as roads or paths (Chen, Ji, and Wang 2014) and (Birant and Kut 2007). The disadvantage is that these algorithms effectively learn what density of observations to expect from the most dense clusters in the data, then ignore potential clusters with lower density. Since individuals loiter in locations of varying size, and for varying time intervals, we expect that a characterization of all loiter locations in a GPS dataset would include clusters of varying density. HDBSCAN, which allows the DBSCAN algorithm to identify clusters of varying density, is more suitable for identifying loiter locations.
This tutorial demonstrates an implementation on the density-based clustering algorithm HDBSCAN, to infer loiter information about an individual of interest from a GPS dataset. We begin by conducting collection analysis to understand the limitations of the available data. Then we compute data-informed HDBSCAN parameters and cluster spatially. Next, we examine each spatial cluster and cluster again temporally to ensure that the cluster is better understood as a loiter location than a location frequently visited in transit. Finally, we process each cluster for loiter information and visualize our results on an interactive map.
This tutorial uses movement data from grey wolves in northeastern Alberta’s Athabasca Oil Sands Region, available here. The primary use of this data is in support of the Arctic Animal Movement Archive. The data consists of a large Comma Separated Values (csv) file recording (non-comprehensively) the locations of 46 wolves between March 17, 2012 to September 13, 2014. Further information on how this telemetry data was collected is available here. We use this data as a placeholder for GPS data on persons of interest.
We begin by reading in and processing the wolf movement data using the R package data.table (Dowle et al. 2019). We use this package extensively in this tutorial, taking advantage of its unique methods for large data exploration and manipulation. Check here for a very insightful introduction to data.table.
The tabs below demonstrate the Extraction Transformation and Load (ETL) process implemented in this tutorial.
The computation below leverages data.table’s fread function to read the csv and answer some preliminary exploratory questions concerning the data.
# Leverage fread function from package data.table to quickly read in csv data as an R object.
initData <- fread("data/ABoVE_ Boutin Alberta Grey Wolf.csv")
# How many wolves?
initData[# all rows
,
# data.table fast unique count for the column
# labeling the different wolves
uniqueN(`tag-local-identifier`)]## [1] 43
# How big is the data set
utils::object.size(initData)## 32539288 bytes
# What are the dimensions
dim(initData)## [1] 239194 18
# View the first five rows
head(initData,3) %>% formattable(align="l") %>%
formattable::as.datatable(rownames=FALSE,
options=list(
searching = FALSE,
scrollX = TRUE,
columnDefs = list(list(className = 'dt-left', targets = '_all'))))The raw data set has 239,194 observations on 46 wolves. We only need the following columns for our study:
The following computation extracts the data relevant to this study, and customizes the column labels for convenience.
# Load only the relevant columns
data <- fread("data/ABoVE_ Boutin Alberta Grey Wolf.csv",
select=c("study-local-timestamp",
"tag-local-identifier",
"location-long",
"location-lat"),
# Make sure data.table does not automatically generate factor columns
stringsAsFactors = FALSE) %>%
# Omit NAs in the data. Familiarization with how the data was collected is
# necessary to consider retaining these values and making them useful
na.omit()
# Set the column names for convenience
setnames(data,
# Vector of old names that we want to change
old=c("study-local-timestamp",
"tag-local-identifier",
"location-long",
"location-lat"),
# Vector of new more convenient names
new=c("dtg", # date time group
"cid", # component ID
"lon", # longitude
"lat")) # latitudeThis study focuses on making inferences about one individual, so we use one wolf as a surrogate. The computation below identifies a wolf of interest and subsets the data to only include this wolf’s observations.
# Use data.table indexing to determine to wolf with the most data
wolf.of.Interest <- data[,.(count=.N),by="cid"][count==max(count)]$cid # The additional [] gives us which cid has the maximum count
# Subset to focus on one wolf:
dt_init <- data[cid==wolf.of.Interest]
m <- nrow(dt_init )
# Create datetime objects from dtg character strings
dt_init [,"dtg" := dtg %>% as.POSIXct(format="%Y-%m-%d %H:%M:%S")]
# Order the data sequentially by date time group
setorder(dt_init,dtg)
# Set inter-obs time interval column
dt_init[,"timeInt" := dtg %>% difftime(shift(dtg),unit="secs")]
dt_init[,"timeInt" := ifelse(timeInt <= 24*3600,timeInt,NA)]
# Use lubridate package to get the weekday from the date objects
dt_init[,"Weekday" := dtg %>% lubridate::wday(label=TRUE)]
# Get the hour from the date objects
dt_init[,"Hour" := lubridate::hour(dtg)]
# Get the time of day for each dtg
dt_init[,"time" := as.ITime(dtg)]
# Set group time label (for plotting)
dt_init$group <- cut(dt_init$time,
breaks=c(0,6*3600,9*3600,12*3600,
15*3600,18*3600,21*3600,24*3600),
labels=c("0000 - 0600","0600 - 0900","0900 - 1200",
"1200 - 1500","1500 - 1800","1800 - 2100",
"2100 - 2400"))
save(dt_init,file="products/dt_init.RData")Before implementing our clustering algorithm, we perform collection analysis to assess the feasibility of clustering and to measure the degree of any temporal collection bias. We use a novel statistical test to show that the clusters present in the wolf of interest’s data are different from the clusters seen in the wolf data as a whole. We also assess there is not significant evidence of weekday or time-of-day collection bias. We do however find there is significant collection bias over month and year. There are some time periods where the collection volume is very high and other periods where there is little to no collection. The tabs below demonstrate our collection analysis.
We manipulate the data and generate a bivariate heatmap to study the the collection volume on every weekday, for each hour of the day.
Figure 1: Collection Volume by Hour and Weekday
This chart shows us that besides a low level of collection in the afternoon on Monday and early in the morning on Sunday, the collection is generally uniform across all weekdays and time of days. Further analysis is required to determine the statistical significance of how far this data departs from a bivariate uniform distribution. That analysis is omitted in this study, and we assume the data is generally free of hourly and weekday bias.
Next we check the collection volume over the time range of the data to judge the collection bias with respect to each day of the study. The computation below generates a scatter plot that conveys how the collection volume has changed over time.
Figure 2: Collection Volume over Time
The interactive chart reveals three spikes in collection volume in April 2012, March 2013, and January to March 2014, with drastically lower observation counts on the other dates. Also, there are no observations between May 2012 and March 2013, which can drastically skew the interobservation time distribution. Unlike the relatively even distribution of collection for weekday and time of day, this chart conveys severe collection bias over the time range of the study, which means we cannot assume random collection with respect to year or season. Further analysis is required to investigate the cause of the collection bias. We continue our study, restrictings ourselves to identifying locations where the wolf of interest is likely loitering, but mindful that these locations may be year- or season-dependent.
The wolf’s activity between his observations is unknown, which can skew our results. We need to study the inter-observation time between the observations of the wolf’s activity to judge how well the data represents the wolf’s locations.
Figure 3: Inter-observation Time Histogram
The histogram above shows that 95% of the inter-observation times are less than 15 minutes, which means that loiter locations with a dwell time of one hour may have as few as four observations in our data even if we have nominal coverage of the time period. This is typical of many GPS data sets analysts encounter. We can still make reasonable inferences, with the caveat that inferred loiter locations are only predictions based on the information available, and must be corroborated with other information sources.
We are now ready to begin extracting loiter information from the data. We implement HDBSCAN to first identify spatial clusters, then we cluster again temporally to find space-time clusters of interest. We then process these results and aggregate loiter information from our study, and convey the results in an interactive html map.
The next step in our analysis before employing HDSCAN is to use data-informed parameter values for the spatial and temporal \(cluster\_selection\_epsilons\). These parameters are used in ST-DBSCAN as the maximum spatial and temporal distance allowed for two observations to be clustered together (Birant and Kut 2007). In HDBSCAN, these parameters are the minimum distance considered in evaluating cluster densities, and allow us to group together clusters within this distance (Campello et al. 2015). This parameter prevents the algorithm from incorrectly labeling cluster boundary points as outliers when the cluster has a relatively high density. More information about this parameter is available here. This is a reliable technique for mitigating the known disadvantages of HDBSCAN by combining the algorithm with DBSCAN, while still identifying clusters of varying densities.
We determine optimal parameter values by computing the distance to each observation’s nearest neighbor, sorting these distances, then finding the distance where the increase in nearest neighbor distance is the most dramatic [rahmah_determination_2016]. More details about this method are discussed here, and a tutorial for how to implement this method in Python is available here. We find where the change in the nearest neighbor distance is the most pronounced (also known as the “knee”) by finding the max curvature in the line plot (Satopaa et al. 2011) using the method described here and implemented in Python using the kneed module available here(Arvai 2020a).
We define the Python function below to calculate the max cuvature of a line plot. We use it by setting the x-axis as the order vector and the y-axis as the distances vector, which produces a convex increasing curve. The y-value for the knee in the curve is the optimal parameter value for \(cluster\_selection\_epsilon\).
from kneed import KneeLocator
def findKnee(order,distances,smoothParam,interpMethod='polynomial'):
kneedle = KneeLocator(order,
distances,
S=smoothParam,
interp_method=interpMethod,
curve='convex',
direction='increasing',
online=True)
return kneedle.knee_yAs described here, we first calculate the spatial distance to the nearest neighbor for each observation and sort these values. We then find the maximum curvature, and plot the results.
Figure 4: Sorted Nearest Neighbor Spatial Distance (meters)
The plot above shows that ~200 meters is the distance just before the nearest neighbor distances rapidly increase. We store this value as epsSpat for use later.
We now apply the same method to get the optimal temporal epsilon distance.
Figure 5: Sorted Nearest Neighbor Temporal Distance (seconds)
The plot above shows that the optimal temporal epsilon distance is 12 seconds. Even though the inter-observation times are mostly between 10 to 15 minutes, the differences in the time of day between observations are much closer. We store this value as epsTime for use later.
Equipped with data informed parameters, we are ready to cluster spatially and temporally using HDBSCAN. We define the Python function below, which uses Python’s sklearn HDBSCAN algorithm (McInnes 2020). The function allows us to cluster using the haversine distance when we are clustering spatially and allows us to cluster temporally by inputting a pre-computed time distance matrix.
# Imports
import numpy as np
import pandas as pd
import hdbscan
# Functions
def runHDB(points,minClusterSize,epsDist,minSampleSize=None,distMatrix=None,verbose=False):
# Organize parameters
if minSampleSize is None:
minSampleSize = minClusterSize # restore default
# Define cluster object
if distMatrix is not None:
clusterFinder = hdbscan.HDBSCAN(min_cluster_size=int(minClusterSize),
min_samples=int(minSampleSize),
metric="precomputed",
cluster_selection_epsilon=epsDist,
cluster_selection_method="eom")
X = distMatrix
else:
clusterFinder = hdbscan.HDBSCAN(min_cluster_size=int(minClusterSize),
min_samples=int(minSampleSize),
metric="haversine",
cluster_selection_epsilon=(epsDist/1000)/6371,
cluster_selection_method="eom")
X = np.radians(points)
if verbose:
print("Running HDBSCAN on {} observations".format(len(X)))
res = clusterFinder.fit(X)
y = res.labels_
if verbose:
print("Found {} clusters".format(pd.Series(y).max() + 1))
return y + 1The computation below applies HDBSCAN with the data-informed parameters for \(cluster\_selection\_epsilon\) to find spatial clusters, then again on each spatial clusters to find temporal limits to those clusters. We use the minimum allowable cluster size of 2 to minimize how “smooth” the cluster density estimate is between nearby points, which mitigates HDBSCAN’s tendency to identify overlapping clusters (“Outlier Detection - Help to Understand the Right Parameter Configuration. · Issue #116 · Scikit-Learn-Contrib/Hdbscan. GitHub” n.d.). We also use a high minimum sample size to discriminate noisy points from cluster boundary points. This method for using minimum cluster size and minimum sample size is discussed in greater detail here (“Outlier Detection - Help to Understand the Right Parameter Configuration. · Issue #116 · Scikit-Learn-Contrib/Hdbscan. GitHub” n.d.).
# Cluster spatially
Y <- runHDB(points=dt[,c("lon","lat")],
minClusterSize=2,
minSampleSize=100,
epsDist=epsSpat)
dt[,"spatClus" := Y]
# Cluster temporally
tempClus <- rep(0,m)
for (i in unique(dt[spatClus != 0, spatClus])) {
Y <- runHDB(points=NULL,
distMatrix=timeDistMat[dt$spatClus==i,dt$spatClus==i],
minClusterSize=2,
minSampleSize=100,
epsDist=epsTime)
# label
tempClus[dt$spatClus==i] <- ifelse(Y!=0,paste0(i,".",Y),0)
}
# Set cluster column
dt[,"clus" := tempClus]
save(dt,file="products/dt.RData")The table above shows the results of the initial clustering. We see that some clusters only include two observations, and some clusters include above 100 observations, which is a result of HDBSCAN identifying clusters of varying densities. After computing an initial clustering, we screen out clusters that do not include consecutive observations, as as they are not temporally convex (i.e. are made by combining two different times when there is evidence that the wolf was elsewhere in between). The following section describes our method to screen the clusters for evidence that the wolf loitered in the cluster location.
We begin screening by finding clusters with consecutive observations. We define the following R function to find the indices of the consecutive observations (values), and the lengths of each segment of consecutive observations (lengths), using a modified version of the R function available here (“Cran/Cgwtools. GitHub” n.d.).
# Function to find consecutive integers, and the length of each consecutive instance
seqle <- function(x,incr=1) {
if(!is.numeric(x)) x <- as.numeric(x)
n <- length(x)
y <- x[-1L] != x[-n] + incr
i <- c(which(y|is.na(y)),n)
temp <- list(lengths = diff(c(0L,i)),
values = x[head(c(0L,i)+1L,-1L)])
return(list(lengths=temp$lengths[temp$lengths > 1] - 1,
values=temp$values[temp$lengths > 1]))
}
# Quick look
seqle(which(dt$clus==1.1)) %>% as.data.table() %>% formattable(align="l") %>% as.htmlwidget()The table above is the result of our seqle function on one cluster. The values column indicates the index of the start of a consecutive segment of observations, and the corresponding lengths column indicates the number of consecutive observations that begin at the value index. We see that for cluster number 1.1 there are consecutive segments with over 1,000 observations.
We now apply the seqle function to each cluster, filter out the segments consisting of only one observation (no evidence of loitering), and calculate the loiter time for each consecutive segment in each cluster. We then filter out the clusters where the wolf visited less than 10 times, and filter out clusters where the wolf spent an average of less than 30 minutes. Full code available here.
Now that we have stored the results in our data and a list object, we can use this information to plot using leaflet. The functions available here provide a comprehensive method to convey the results of our analysis. We omit the functions in this tutorial for brevity.
The map below conveys the results of clustering spatially and temporally using HDBSCAN. The loiter information for each cluster is shown when hovering over each cluster, and a radial heatmap of time of day by weekday is shown when clicking each cluster. We group the clusters by the time of day and include interactive toggles to allow the analyst to study the wolf’s pattern of life.
This tutorial demonstrates a method to extract actionable information from GPS data. A heatmap, or hexagon bin density map can provide initial intuition about the subject’s most frequented locations, but more granular information is typically necessary. Viewing the plot above, an analyst can become overwhelmed while attempting to find meaningful conclusions from only viewing the observations. Once we conduct a spatial and temporal clustering, we can elicit reliable information about when, where, and for how long the subject (here, the wolf of interest) loiters.